Skip to content

eval: record scifact + cmteb public-anchor calibration (Milestone B)#9

Merged
helebest merged 1 commit into
mainfrom
eval/anchor-calibration
Jun 28, 2026
Merged

eval: record scifact + cmteb public-anchor calibration (Milestone B)#9
helebest merged 1 commit into
mainfrom
eval/anchor-calibration

Conversation

@helebest

Copy link
Copy Markdown
Collaborator

Milestone B, rebased onto main (codex/M3). Records scifact + cmteb public-anchor calibration in reports/BASELINES.md. Supersedes #6 (auto-closed when #5's base branch was deleted during the stacked squash-merge).

Retrieval baselines are embedder-driven (Gitee Qwen3-Embedding-0.6B@1024, unchanged by the M2.7→M3 LLM bump); the LLM is unused for retrieval-only anchor eval.

🤖 Generated with Claude Code

First real-vector baselines, validating the eval chain end-to-end on
non-saturated, literature-anchored data (the Phase 0→1 advance criterion).

- scifact (en): doc/hybrid ndcg_at_10 0.689 — matches BEIR literature 0.67
  (Δ 0.019, within ±0.10) and clears dikw-core's committed floor. RRF lift real
  (bm25 0.651 < vector 0.673 < hybrid 0.689).
- cmteb-t2-subset (zh): reproduces dikw-core's calibrated subset baseline within
  noise (bm25 ndcg_at_10 0.840 exact; hybrid 0.946 vs 0.952; vector 0.943 vs
  0.942); clears all dataset thresholds. jieba CJK confirmed (0.840, not the
  degenerate 0.031). NB: this is a curated 300-q subset, not the full CMTEB
  leaderboard (~0.50) — see its dataset.yaml.
- Cross-cutting: multi-batch embedding confirmed (5183 / 5000 chunks ≫ 16/batch,
  the Phase 0 gap); read-only held (scifact materialized into gitignored paths,
  dataset.yaml restored, dikw-core tree clean).

No gates set — calibration only; per-language thresholds wait for the in-house
domain-bilingual-v1 / negatives-ood-v1 sets.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@helebest helebest merged commit 6b16095 into main Jun 28, 2026
2 checks passed
@helebest helebest deleted the eval/anchor-calibration branch June 28, 2026 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant